Consistent Weighted Sampling
نویسندگان
چکیده
We describe an efficient procedure for sampling representatives from a weighted set such that for any weightings S and T , the probability that the two choose the same sample is equal to the Jaccard similarity between them: Pr[sample(S) = sample(T )] = ∑ x min(S(x), T (x)) ∑ x max(S(x), T (x)) where sample(S) is a pair (x, y) with 0 < y ≤ S(x). The sampling process takes expected computation linear in the number of non-zero weights in S, independent of the weights themselves. Sampling computations of this form are commonly limited mainly by the required (pseudo) randomness, which must be carefully maintained and reproduced to provide the consistency properties. Whereas previous approaches require randomness dependent on the sizes of the weights, we use an expected number of bits per weight independent of the values of the weights themselves. Furthermore, we discuss and develop the implementation of our sampling schemes, reducing the requisite computation and randomness substantially in practice.
منابع مشابه
Median Estimation in Sample Surveys
In a recent paper Maritz and Jarrett (1978) proposed a small-sample estimate of the variance of sample medians from continuous population. In this paper their methods are adapted to median estimation in s~atified sampling without replacement from finite populations. A weighted sample median for estimating the median of heavy-tailed or skewed populations is proposed. Its asymptotic normal distri...
متن کاملWeighted Likelihood for Semiparametric Models and Two-phase Stratified Samples, with Application to Cox Regression
Weighted likelihood, in which one solves Horvitz-Thompson or inverse probability weighted (IPW) versions of the likelihood equations, offers a simple and robust method for fitting models to two phase stratified samples. We consider semiparametric models for which solution of infinite dimensional estimating equations leads to √ N consistent and asymptotically Gaussian estimators of both Euclidea...
متن کاملWeighted Empirical Likelihood in Some Two-sample Semiparametric Models with Various Types of Censored Data
In this article, the weighted empirical likelihood is applied to a general setting of two-sample semiparametric models, which includes biased sampling models and case-control logistic regression models as special cases. For various types of censored data, such as right censored data, doubly censored data, interval censored data and partly interval-censored data, the weighted empirical likelihoo...
متن کاملConsistent Weighted Sampling Made More Practical
Min-Hash, which is widely used for efficiently estimating similarities of bag-of-words represented data, plays an increasingly important role in the era of big data. It has been extended to deal with real-value weighted sets – Improved Consistent Weighted Sampling (ICWS) is considered as the state-of-the-art for this problem. In this paper, we propose a Practical CWS (PCWS) algorithm. We first ...
متن کاملImproved Consistent Weighted Sampling Revisited
Min-Hash is a popular technique for efficiently estimating the Jaccard similarity of binary sets. Consistent Weighted Sampling (CWS) generalizes the Min-Hash scheme to sketch weighted sets and has drawn increasing interest from the community. Due to its constant-time complexity independent of the values of the weights, Improved CWS (ICWS) is considered as the state-of-the-art CWS algorithm. In ...
متن کامل